DASC 500: Introduction to Data Analytics

A Gentle Introduction to Supervised Learning Algorithms

Maj Jason Freels PhD

29 January 2020

Overview

Supervised Vs Unsupervised Learning Algorithms

Linear regression w/ Python

Supervised vs Unsupervised Learning Algorithms

So, when would an unsupervised learning algorithm be used?

Supervised learning algorithms can be divided into two main classes

Regression algorithms: when the output/response variable is numeric and continuous

Classification algorithms: when the output/response variable is discrete or categorical

Model Complexity vs. Model Accuracy

The big picture of supervised learning algorithms

Common elements of supervised learning algorithms

Supervised learning Example

Elements of supervised learning algorithms: #1 Data

Plot of some ideal data

Plot of some ideal data

Elements of supervised learning algorithms: #2 an assumed form of \(f_{\text{imperfect}}\)

Fitting the ideal data with a "perfect" model

Fitting the ideal data with a “perfect” model

Elements of supervised learning algorithms: #3 parameter values

Elements of supervised learning algorithms: #4 loss functions

Supervised Learning Example (Cont.)

Visualizing loss functions

A Naive loss function

\[ Loss_{_{naive}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N y_i-m\times x_i-b. \]

def naive_loss(params, x,y):

  if len(params) != 2: sys.exit("Need 2 parameters -- dummy!") 

  m = params[0]
  b = params[1]

  loss = 0

  for i in range(len(y)):
    loss = loss + (y[i] - m * x[i] - b)

  return(loss)
from scipy.optimize import minimize
import numpy as np
import sys

x = np.arange(0,6,0.5)
y = x * 5 + 3

res = minimize(naive_loss,
               x0 = [0,0], 
               args = (x,y))

res
      fun: -12693503877.002071
 hess_inv: array([[1, 0],
       [0, 1]])
      jac: array([0., 0.])
  message: 'Optimization terminated successfully.'
     nfev: 64
      nit: 1
     njev: 16
   status: 0
  success: True
        x: array([3.39728820e+08, 1.23537753e+08])



A good loss function

\[ Loss_{_{absolute}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big\vert y_i-m\times x_i-b\Big\vert. \]

def absolute_loss(params, x,y):

  if len(params) != 2: sys.exit("Need 2 parameters -- dummy!") 

  m = params[0]
  b = params[1]

  loss = 0

  for i in range(len(y)):
    loss = loss + np.abs(y[i] - m * x[i] - b)

  return(loss)
from scipy.optimize import minimize
import numpy as np
import sys

x = np.arange(0,6,0.5)
y = x * 5 + 3

res = minimize(absolute_loss,
               x0 = [0,0], 
               args = (x,y))

res
      fun: 3.135918174024255e-07
 hess_inv: array([[ 2.22446360e-09, -1.33536208e-08],
       [-1.33536208e-08,  8.83478893e-08]])
      jac: array([ 2.31926751, -1.050596  ])
  message: 'Desired error not necessarily achieved due to precision loss.'
     nfev: 471
      nit: 20
     njev: 115
   status: 2
  success: False
        x: array([4.99999998, 3.00000003])



A better loss function

\[ Loss_{_{convex}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big( y_i-m\times x_i-b\Big)^2. \]

def convex_loss(params, x,y):

  if len(params) != 2: sys.exit("Need 2 parameters -- dummy!") 

  m = params[0]
  b = params[1]

  loss = 0

  for i in range(len(y)):
    loss = loss + (y[i] - m * x[i] - b)**2

  return(loss)
from scipy.optimize import minimize
import numpy as np
import sys

x = np.arange(0,6,0.5)
y = x * 5 + 3

res = minimize(convex_loss,
               x0 = [0,0], 
               args = (x,y))

res
      fun: 1.81798298894111e-14
 hess_inv: array([[ 0.01398601, -0.03846154],
       [-0.03846154,  0.1474359 ]])
      jac: array([-8.88152632e-07, -7.45724819e-07])
  message: 'Optimization terminated successfully.'
     nfev: 28
      nit: 5
     njev: 7
   status: 0
  success: True
        x: array([5.        , 2.99999997])



How can I know which algorithm to choose?

Linear Regression

Overview

Replication Requirements

Importing libraries

import random
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statistics as st

understanding the data

Accessing the data

data_url = "http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv"

df = pd.read_csv(data_url)
df.head()
   Unnamed: 0     TV  radio  newspaper  sales
0           1  230.1   37.8       69.2   22.1
1           2   44.5   39.3       45.1   10.4
2           3   17.2   45.9       69.3    9.3
3           4  151.5   41.3       58.5   18.5
4           5  180.8   10.8       58.4   12.9
df2 = pd.read_csv(data_url,
                  usecols = {'TV','newspaper','radio','sales'})

df2.head()
      TV  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3    9.3
3  151.5   41.3       58.5   18.5
4  180.8   10.8       58.4   12.9

Preparing Our Data

Partitioning the data

rows = range(df2.shape[0])

train_rows = random.sample(rows, 
                           int(0.6 * len(rows)))

test_rows = list(set(rows) - set(train_rows))

train_data, test_data = df2.iloc[train_rows], df2.iloc[test_rows]

train_data.shape
(120, 4)
test_data.shape
(80, 4)

Simple Linear Regression - 1

Overview

Simple linear regression - 2

Building the model

\[ \begin{aligned} \boldsymbol{y} &= \beta_0 + \beta_{1}\boldsymbol{X} + \epsilon\\\\ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} &= \begin{bmatrix} 1& X_{1}\\ 1& X_{2}\\ \vdots & \vdots \\ 1& X_{m} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix}+ \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{m} \end{bmatrix} \end{aligned} \]

X = train_data[{"TV"}]
y = train_data["sales"]
X = sm.add_constant(X)
model = sm.OLS(y, X)

model_fit = model.fit()

Simple linear regression - 3

Checking the assumptions

Assuption #1: The model is linear in parameters

Assumption #2: The mean of the residuals is zero

## model residuals
model_residuals = model_fit.resid

st.mean(model_residuals)
-4.51490696680897e-16
plot = sb.residplot(model_fit.fittedvalues, 
                    train_data.columns[3], 
                    data=train_data,
                    lowess=True,
                    scatter_kws={'alpha': 0.5},
                    line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot.set_title('Residuals vs Fitted')
plot.set_xlabel('Fitted values')
plot.set_ylabel('Residuals');

plt.show(plot)

Assumption #3: Constant variance among the residuals

df2 = pd.read_csv(data_url,
                  usecols = {'TV','newspaper','radio','sales'})

##apply a transformation on the TV column
df2["TV"] = (df2["TV"])**2

# apply a transformation on the sales column
df2["sales"] = np.sqrt(df2["sales"])

train_data, test_data = df2.iloc[train_rows], df2.iloc[test_rows]

X = train_data[{"TV"}]
y = train_data["sales"]

X = sm.add_constant(X)

model = sm.OLS(y, X)

model_fit = model.fit()

## model residuals
model_residuals = model_fit.resid

st.mean(model_residuals)

plot_lm = plt.figure()
plot_lm.axes[0] = sb.residplot(model_fit.fittedvalues, 
                               train_data.columns[3],
                               data=train_data,
                               lowess=True,
                               scatter_kws={'alpha': 0.5},
                               line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm.axes[0].set_title('Residuals vs Fitted')
plot_lm.axes[0].set_xlabel('Fitted values')
plot_lm.axes[0].set_ylabel('Residuals');

plt.show()

Assumption #3 - the residuals are normally distributed

from statsmodels.graphics.gofplots import ProbPlot

model_norm_residuals = model_fit.get_influence().resid_studentized_internal

QQ = ProbPlot(model_norm_residuals)

plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)

plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');

# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]

for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i,
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));
plt.show()

Understand the results of a regression model output

print(model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.595
Model:                            OLS   Adj. R-squared:                  0.591
Method:                 Least Squares   F-statistic:                     173.2
Date:                Wed, 29 Jan 2020   Prob (F-statistic):           6.80e-25
Time:                        12:50:22   Log-Likelihood:                -311.21
No. Observations:                 120   AIC:                             626.4
Df Residuals:                     118   BIC:                             632.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.2470      0.599     12.101      0.000       6.061       8.433
TV             0.0459      0.003     13.159      0.000       0.039       0.053
==============================================================================
Omnibus:                        0.520   Durbin-Watson:                   2.131
Prob(Omnibus):                  0.771   Jarque-Bera (JB):                0.511
Skew:                          -0.153   Prob(JB):                        0.775
Kurtosis:                       2.907   Cond. No.                         345.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using the model to predict results for new inputs

X_new = test_data["TV"]
X_new = sm.add_constant(X_new)

# make the predictions by the model
test_data["predictions"] = model_fit.predict(X_new) 

test_data
        TV  radio  newspaper  sales  predictions
1     44.5   39.3       45.1   10.4     9.289676
3    151.5   41.3       58.5   18.5    14.201343
9    199.8    2.6       21.2   10.6    16.418479
14   204.1   32.9       46.0   19.0    16.615864
15   195.4   47.7       52.9   22.4    16.216504
..     ...    ...        ...    ...          ...
188  286.0   13.9        3.7   15.9    20.375356
189   18.7   12.1       23.4    6.7     8.105367
190   39.5   41.1        5.8   10.8     9.060158
191   75.5   10.8        6.0    9.9    10.712682
195   38.2    3.7       13.8    7.6     9.000484

[80 rows x 5 columns]